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Abstract 

We consider the off-policy evaluation problem in Markov deci¬ 
sion processes with function approximation. We propose a generaliza¬ 
tion of the recently introduced emphatic tempo ral differences (ETD) 
algorithm ISutton, Mahmood, and White! bOlsK which encompasses 
the original ETD(A), as well as several other off-policy evaluation al¬ 
gorithms as special cases. We call this framework ETD(A, /?), where 
our introduced parameter j3 controls the decay rate of an importance¬ 
sampling term. We study conditions under which the projected fixed- 
point equation underlying ETD(A, /3) involves a contraction opera¬ 
tor, allowing us to present the first asymptotic error bounds (bias) for 
ETD(A, /?). Our results show that the original ETD algorithm always 
involves a contraction operator, and its bias is bounded. Moreover, by 
controlling /3, our proposed generalization allows trading-off bias for 
variance reduction, thereby achieving a lower total error. 


1 Introduction 


In Reinforcement Learning (RL; Sutton and Barto 1998h . policy-evaluation refers to 
the problem of evaluating the value function - a mapping from states to their long¬ 
term discounted return under a given policy, using sampled observations of the system 
dynamics and reward. Policy-evaluation is important both for assessing the quality of 
a policy, but also as a sub-procedure for policy optimization. 

For systems with large or continuous state-spaces, an exact computation of the 
value function is often impossible. Instead, an approximate value-function is sought us- 
ing various func tion-approximation techniques (a.k.a. approximate dynamic-programming; 
Bertsekasll2012h . In this approach, the parameters of the value-function approxima¬ 
tion are tuned using machine-learning inspired methods, often based on temporal- 
differences (TD Sutton and Bart^ll998 ). 

The source generating the sampled data divides policy evaluation into two cases. 

In the on-policy case, the samples are generated by the target-policy - the policy under 
evaluation; In the off-policy setting, a different behavior-policy generates the data. In 
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the on-policy setting, TD methods are well understood, with classic convergence guar¬ 
antees and approximation-error bounds , based on a contraction prope rty of the pro¬ 
jected Bellman operator underlying TD ( Bertsekas and Tsitsiklisl 1996h . These bounds 
guarantee that the asymptotic error, or bias, of the algorithm is contained. For the 
off-policy case, however, standard TD methods no longer maintain this cont raction 
prope rty, the error bounds do not hold, and these methods might even diverge (IBaird , 
Il995h . 

The standard error-bounds m ay be shown to hold for an imp ortance-sampling TD 
method (IS-TD), as proposed bv IPrecup, Sutton, and Dasguptal (1200 ih . However, this 
method is known to suffer from a high variance of its importance-sampling estimator, 
limiting its practicality. _ 

Lately, Sutton, Mahmood, and Whit3 ( 2015 ) proposed the emphatic T D (ETD ) al¬ 
gorithm: a modification of the TD idea, which converges off-policy (lYul 120151) . and 
has a reduced variance compared to IS-TD. This variance reduction is achieved by in¬ 
corporating a certain decay factor over the importance-sampling ratio. However, to the 
best of our knowledge, there are no results that bound the bias of ETD. Thus, while 
ETD is assured to converge, it is not known how good its limit actually is. 

In this paper, we propose the ETD( A, /3) framework - a modification of the ETD(A) 
algorithm, where the decay rate of the importance-sampling ratio, /3, is a free parameter, 
and A is the same bootstrapping parameter employed in TD( A) and ETD(A). By varying 
the decay rate, one can smoothly transition between the IS-TD algorithm, through ETD, 
to the standard TD algorithm. 

We investigate the bias of ETD(A, /?), by studying the conditions under which its 
underlying projected Bellman operator is a contraction. We show that the original ETD 
possesses a contraction property, and present the first error bounds for ETD and ETD(A, 
P). In addition, our error bound reveals that the decay rate parameter balances between 
the bias and variance of the learning procedure. In particular, we show that selecting a 
decay equal to the discount factor as in the original ETD may be suboptimal in terms 
of the mean-squared error. 

The main contributions of this work are therefore a unification of several off-policy 
TD algorithms under the ETD(A, /3) framework, and a new error analysis that reveals 
the bias-variance trade-off between them. 


Related Work: In recent years, several different off-policy policy-evaluation a lgo¬ 
rithm s have been studied, suc h as importance-sampling based lea st-squares TD (lYul 
2012h . and gradient-based TD ( Sutton et al. . 20091: Liu et al.L 2015h . These algorithms 


are guaranteed to converge, however, their asymptotic error can be b ounded only when 
the target and behavior policies are similar ((Bertsekas and Yull2009h . or when their in - 
duced transition matrices satisfy a certain matrix-inequality suggested bv lKolteri (1201 ih . 
which limits the discrepancy between the target and beha vior policies. W hen these con¬ 
ditions are not satisfied, the error may be arbitrarily large ( Kolteil 201 ih . In contrast, the 
approximation-error bounds in this paper hold for general target and behavior policies. 
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2 Preliminaries 

We consider an MDP M = {S, A, P, R, 7 ), where S is the state space, A is the action 
space, P is the transition probability matrix, R is the reward function, and 7 G [0,1) is 
the discount factor. 

Given a target policy tt mapping states to a distribution over actions, our goal is to 
evaluate the value function: 


V^{s) = 


'^R{st,at) 


t^O 


Sq = S 


Linear temporal difference methods ( Sutton and Bartol 1998h approximate the value 
function by 

where ip{s) G R" are state features, and 0 G MA are weights, and use sampling to find 
a suitable 9. Let fj, denote a behavior policy that generates the samples sq, oo, Si, ai,... 
according to at ^ /r(-|st) a ndsf-n ^ at). We denote by p t theratio 7 r(at|st)//r(at|st), 

and we assume, similarly to l Sutton. Mahmood. and White! (12015h . that /r and tt are such 
that Pt is well- definecOl for all t. 

Let T denote the Bellman operator for policy tt, given by 

T{V) = R + -fPV, 

where R and P are the reward vector and transition matrix induced by policy tt, and 
let $ denote a matrix whose columns are the feature vectors for all states. Let c?^ 
and denote the stationary distributions over states induced by the policies p and tt, 
respectively. For some d G satisfying d > 0 element-wise, we denote by lid a 
projection to the subspace spanned by ip{s) with respect to the d-weighted Euclidean- 
norm. 


For A = 0, the ETD(0, /3) dSutton, Mahmood. and White! 12015h algorithm seeks 
to find a good approximation of the value function by iteratively updating the weight 
vector 9: 


9t+i = 9t + aPtptiRt+i + idjipt+i - dj‘Pt)‘Pt 

Ft = fdpt-iFt-i -f 1, Fq = 1, 


where Ft is a decaying trace of the importance-sampling ratios, and /3 G (0,1) controls 
the decay rate. 


Remark 1. The algorithm of Sutton. Mahmood. and WhitA ( 2015 ) selects the decay 
rate equal to the discount factor, i.e., j5 = ^. Here, we provide more freedom in choos¬ 
ing the decay rate. As our analysis reveals, the decay rate controls a bias-variance 
trade-off of ETD, therefore this freedom is important. More over, we note that for l3 = 0, 
we obtain the standard TD in an off-policy s ettine\^ J 20721). and when jd = 1 we o btain 
the full importance-sampling TD algorithm PrecuD. Sutton, and DassuptAilO^] . 


^Namely, if fi{a\s) = 0 then 7r(a|s) = 0 for all s £ S. 
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Remark 2. The ETD(0, algorithm of Sutton. Mahmood. and White 1 20751) also in¬ 
cludes a state-dependent emphasis weight i{s), and a state-dependent discount factor 
7 ( 5 ). Here, we analyze the case of a uniform weight i{s) = 1 and constant discount 
factor jfor all states. While our analysis can be extended to their more general setting, 
the insights from the analysis remain the same, and for the purpose of clarity we chose 
to focus on this simpler setting. 


An important term in our analysis is the emphatic weight vector /, defined by 

f = dl{I-pP)-\ 


( 2 ) 


It can be shown dSutton. Mahmood. and White , 2015 : 2015 ). that ETD(0, fJ) con¬ 

verges to 0 * - a solution of the following projected fixed point equation: 


V = nfTV, 


V G 


(3) 


For the fixed point equation Q, a contraction pro perty of IlfT is important for g uar- 
anteeing both a unique solution, and a bias bound (iBertsekas and Tsitsiklisl 199^ . 

It i s well known that T is a 7 -contr action with respect to the d^-weighted Euclidean 
norm ( Bertsekas and Tsitsiklis . 19961) . and by definition 11/ is a non-expansion in /- 
norm, however, it is not immediate that the c omposed operator If fT is a contraction in 
any norm. Indeed, for the TD(0) algorithm ( Sutton and Bartol 1998t corresponding to 
the /3 = 0 case in our setting), a similar representation as a projected Bellman operator 
holds, but it m ay be shown that in the off-policy setting the algorithm might diverge 
dBairdi Il995h . In the next section, we study the contraction properties of II/T, and 
provide corresponding bias bounds. 


3 BiasofETD(0,;5) 


In this section we study the bias of the ETD(0, /3) algorithm. Let us first introduce the 
following measure of discrepancy between the target and behavior policies: 


K = min 

S 


/(s) ’ 


Lemma 1. The measure n obtains values ranging from n = 0 (when there is a state 
visited by the target policy, but not the behavior policy), to k = 1 — f3 (when the two 
policies are identical). 


The technical proof is given in the supplementary material. The following theorem 
shows that for ETD(0, /?) with a suitable /3, the projected Bellman operator If/T is 
indeed a contraction. 

Theorem 1. For /3 > 7 ^( 1 —k), the projected Bellman operatorllfT is a ^{1 — k)- 
contraction with respect to the Euclidean f-weighted norm, namely, Vwi, V 2 G 


\\IlfTvi -n/Tp 2 ||/ < 




^^ 211 /- 
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Proof. Let F = diag{f). We have 


Ikll/ “ I® ll-Pi'll/ = v^Fv — I3v^P^FPv 

>(“) Fv — diag{f^P)v 
= — Pdiag{f^ P)]v 

= [diag (/^(/ - /3P))] v 

v^diag{d^)v = \\v\\l^ , 

where (a) follows from Jensen inequality; 

v^P^FPv = /(s)(E 

S s' 

<^f{s)^P{s'\s)v‘^is') 

S s' 

= ^v'^{s')^f{s)P{s'\s) 

s' S 

= v^diag{f^P)v, 

and (b) is by the definition of / in (| 2 ]l. 

Notice that for every v: 

MX ^ d/,(s)v^(s) > Kf(s)v^(s) = K ||u||^ 

s s 


Therefore; 


Ikll/ > /3 \\Pvff + \\v\\l^ > /3 ||Pt;||^ + K \\vfj , 
^ /3||Pu||^ < (l-«;)||u||^ 

and; 

llTui - rw2||/ = ||7-P(t^l - V2)ff 
= 7^ - W2)||/ 

< \\V 1 -V 2 ff 


Hence, T is a y ^ (1 — K)-contraction. Since Hy is a non-expansion in the /-weighted 
norm ( Bertsekas and Tsitsiklis . 1996h . HfT is a \J ~ «;)-contraction as well. □ 

Recall that for the original ETD algorithm dSutton, Mahmood, and White! 1201 5h . 
we have that /3 = 7 , and the contraction modulus is -^ 7(1 — k) < 1, thus the contrac¬ 
tion of H/T always holds. 
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Also note that in the on-policy case, the behavior and target policies are equal, and 
according to Lemma [T] we have 1 — k = /3. In this case, the contraction modulus i n 
Theorem[T]is 7 , si milar to t he res ult for on-policy TD Bertsekas and Tsitsiklis ( 1996h . 

We remark that iKolt^ ( 2011 ) also used a measure of dis crepancy betw een the be¬ 
havior and the target policy to bound the TD-error. However, Kolterl ( 2011 ) considered 
the standard TD algorithm, for which a contraction could be guaranteed only for a class 
of behavior policies that satisfy a certain matrix inequality criterion. Our results show 
that for ETD(0, /3) with a suitable /3, a contraction is guaranteed for general behavior 
policies. We now show in an example that our contraction modulus bounds are tight. 

Example 1. Consider an MDP with two states: Left and Right. In each state there are 
two identical actions leading to either Left or Right deterministically. The behavior 
policy will choose Right with probability e, and the target policy will choose Left with 
probability e, hence 1 — k w 1. Calculating the quantities of interest: 


If). = 

/ = I — p ~ ^ ~ —2e/3 + e + j3)^ . 


So forv = (0,1)^.- 


|2 _ s + P — 2 e(i 
'f~ 1-/3 


= — 1_/3 ’ 11 ^^ 11 / = ^ 


2 (l-eJ! 

P ’ 


and for small e we obtain that ~ fr. 

•’ ILIIj p 

An imme diate consequence of Theorem[T] is the following error bound, based on 
Lemma 6.9 of iBertsekas and TsitsiklisI (119961) : 

Corollary 1. We have 

||$^r - ^ 


|T>^r-L^|L < 

I Wdu. — 


1 




l|n/L--L-||^, 
"UfV^ -V^\ 


f ■ 


Up to the weights in the norm, the error ||n/U^ — V'^Wj is the best approxima¬ 
tion we can hope for, within the capability of the linear approximation architecture. 
Corollary [^guarantees that we are not too far away from it. 

Notice that the error uses a measure which is independent of 

the target policy; This could be useful in further analysis of a policy iteration algorithm, 
which iteratively improves the target policy using sa mples from a sin gle behavior pol¬ 
icy. Such an analysis may proceed similarly to that in Munos ( 2003h for the on-policy 


case. 
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Figure 1; Mean squared error in value function approximation for different behavior 
policies. 


3.1 Numerical Illustration 


We illustrate the importance of the ETD(0, 0) bia s bound in a numerical example. Con¬ 
sider the 2-state MDP example of iKolterl (1201 ih . with transition matrix P = (1/2)1 
(where 1 is an all 1 matrix), discount factor 7 = 0.99, and value function V = 
[1,1.05]^ (with R = {I — jP)V). The features are T* = [1,1.05-l-e]^, with e = 0.001. 
Clearly, in this example we have = [0.5,0.5]. The behavior policy is chosen such 
that dfj, = [p,l-p]. 

In Figure [T] we plot the mean-squared error ||<i)^0* — , where 9* is either 

the fixed point of the standard TD equation V — Hd^TV, or the ETD(0, 0) fixed point 
of (12l, with /3 = 7 . We also show the o ptimal error \\ IVd„V — achievable with 

these features. Note that, as observed bv lKolteii (1201 ih . for certain behavior policies the 
bias of standard TD is infinite. Th is means that algor ithms that converge to this fixed 
point, such as the GTD algorithm ( Sutton et ah . 20091), are hopeless in such cases. The 
ETD algorithm, on the other hand, has a bounded bias/or all behavior policies. 


4 The Bias-Variance Trade-Off of ETD(0, ( 5 ) 


Erom the results in Corollary[T] it is clear that increasing the decay rate 0 decreases the 
bias b ound. Indeed, for the case 0 = 1 we obtain the importance sampling TD algo¬ 
rithm (IPrecup, Sutton, and DasguptaLl200lh . whi ch is known to have a bias bound sim- 
ilar to on-policy TD. However, as recogni zed by Precup, Sutton, and Dasgupt^ ( 2001 ) 
and Sutton. Mahmood. and White (l2015l) . the importance sampling ratio Ft suffers 
from a high variance, which increases with 0. The quantity Ft is important as it appears 
as a multiplicative factor in the definition of the ETD learning rule, so its amplitude di¬ 
rectly impacts the stability of the algorithm. In fact, the asymptotic variance of Ft may 
be infinite, as we show in the following example; 


Example 2. Consider the same MDP given in Example\J] only now the behavior policy 
chooses Left or Right with probability 0.5, and the target policy chooses always Right. 
ForETD(0, 0) with 0 C [0,1), we have thatwhen St = Left then F) = 1 (since pt-i = 
OJ. When St = Right, Ft may take several values depending on how many steps, T(t), 
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was the last transition from Left to Right, i.e. T[t) min{j > 0 : St-i = Left}. We 
can write this value as where: 


= ^(2/3)* 

i=0 


(2/3)^+i - 1 
2/3-1 


if2f] ^ 1. Let us assume that 2/3 > 1 since interesting cases happen when /3 is close 
to 1. 

Let’s compute Ft’s average over time: Following the stationary distribution of the 
behavior policy, St = Left with probability 1/2. Now, conditioned on St = Right 
(which happens with probability l/2j, we have T(t) = i with probability 2“*“^. Thus 
the average (over time) value of Ft is 


1 LXJ 

EFt = = 


2(2/3-1) 


1 

20 ^- 


Thus Ft amplifies the TD update by a factor of 2 {i^-i 3 ) average. Unfortunately, 
the actual values of the (random variable) Ft does not concentrate around its expec¬ 
tation, and actually Ft does not even have a finite variance. Indeed the average (over 
time) of Ft is 


EF/ = -^2-*(F*)2 = 


2=0 


E.2-((2/3y+i-l)^ 
4(2^- 1)2 


as soon as > 1. 


So although ETD(0, /3) converges almost surely (as shown by lYull2015h . the vari¬ 
ance of the estimate may be infinite, which suggests a prohibitively slow convergence 
rate. 

In the following proposition we characterize the dependence of the variance of Ft 
on /3. 


Proposition 1. Define the mismatch matrix F^ such that [F^^ 7 r]ss = 

and write a(p,TT) the largest magnitude of its eigenvalues. Then for any /3 < 1/ a{p, tt) 
the average variance of Ft (conditioned on any state) is finite, and 


/32 

( (1 + /3) 

1 

P 

OO 

1-/3 

^ \ 

\ 1-^2 

P 

OO 


where 


fl,7T 


the matrix. 


is the loc-induced norm which is the maximum absolute row sum of 

OO 


Proof. (Partial) Following the same 


derivation that Sutton. Mahmood. and White ( 2015h 
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used to prove that /(s) = (i^(s) limt_>oo ]E[Ft|5't = s], we have 


q{s) = d^(s) lim = s] 

t^co 

= d^{s) lim E[(l + pt-iPFt.if\St = s] 

t^CO 

= d^{s) lim E[1 + 2pt_i/3Ft_i + Fl,\St = s]. 

t—¥00 

For the hrst summand, we get (i^(s). For the second summand, we get; 

2/3d^(s) lim E[pt-iFt-i\St = s] = 2/3 V[P,],,/(s). 

t—¥OC 


The third summand equals 


d'tj.{s)p{a\s)p{s\s, a) , 1 ™ E[F^_^\St-i = s] 

p^{a\s) t->oo 


c n ' ^ I ' c 


Hence g = + 2/3P// + /3^P^^g. Thus for any fd < ljy/a{p, tt), all eigenval¬ 

ues of the matrix /3^P^^ have magnitude smaller than 1 , and the vector q has hnite 
components. The rest of the proof is very technical and is given in Lemma |2] in the 
supplementary material. 

□ 

Proposition [Hand Corollary[T]show that the decay rate fd acts as an implicit trade¬ 
off parameter between the bias and variance in ETD. For large jd, we have a low bias 
but suffer from a high variance (possibly inhnite if /3 > 1 / \/X{p, tt)), and vice versa 
for small jd. Notice that for the on-policy case, X{p, tt) = 1 thus for any ^5 < 1 the 
variance is hnite. 

Originally, ETD(0, /3) was introduced with [d = "f, and from our perspective, it may 
be seen as a specihc choice for the bias-variance trade-off. However, there is no intrinsic 
reason to choose /3 = 7 , and other choices may be preferred in practice, depending on 
the nature of the problem. In the following numerical example, we investigate the bias- 
variance dependence on jd, and show that the optimal jd in term of mean-squared error 
may be quite different from 7 . 


4.1 Numerical Illustration 

We revisit the 2-state MDP described in Section 13.11 with 7 = 0.9, e = 0.2 and 
p = 0.95. Eor these parameter settings, the error of standard TD is 42.55 (p was chosen 
to be close to a point of inhnite bias for these parameters). 

In Pigure|2]we plot the mean-squared error , where 9* was ob¬ 

tained by running ETD(0, jd) with a step size a = 0.001 for 10, 000 iterations, and 
averaging the results over 10,000 different runs. 
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Mean Squared Error in Value Function 



Figure 2: Mean squared error in value function approximation for different decay rates 


First of all, note that for all /3, the error is smaller by two orders of magnitude than 
that of s tandard TD. Thus, a lgorithms that converge to the standard TD fixed point such 
as GTD ISutton et al. ( 20091) are significantly outperformed by ETD(0, (3) in this case. 
Second, note the dependence of the error on /3, demonstrating the bias-variance trade¬ 
off discussed above. Finally, note that the minimal error is obtained for 7 = 0.8, and is 
considerably smaller than that of the original ETD with /3 = 7 = 0.9. 


5 Contraction Property for ETD(A, ( 5 ) 


We now e xtend our results to incorporate eligib ility traces, in the style of the ETD(A) 
algorithm ( Sutton, Mahmood, and White , 2015h . and show similar contraction proper¬ 
ties and error bounds. 

The ETD(A, /3) algorithm iteratively updates the weight vector 0 according to 


Ot+i ■■= 9t + a{Rt+i + -jOj(fit+i - 9j^t)et 
et = pti-fXct-i + Mtifit), e_i = 0 
Mt = A + (1 - A)i^t 

Ft = Ppt-iFt-i - 1 - 1 , Fq = 1 , 


where 

define 


et is the eligibility trace dSutton. Mahmood. and White . 2015 ). In this 
the emphatic weight vector m by 


case, we 






(4) 


where fQj- some a, 6 G K denotes the following matrix; 


pa,b ^ 7 _ (J _ 5aP)-i(/ _ tP). 


The Bellman operator for general A and 7 is given by: 

r(^)(y) = {I -'yXp)-^r + V e 
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For A = 0 we have = jSP, P^’^ = 7 P, and m = / so we recover the definitions 
ofETD(0, /3). 

Recall that our goal is to estimate the value function . Thus, we would l ike to 
know how well the ETD(A, f5) solution approximates lA^. lMahmood et alJ(l2015h show 


that, under suitable step-size conditions, ETD converges to some 6* that is a solution 
of the projected fixed-point equation: 


In their analysis, however, Mahmood et alJ (l2015h did not show how well the solu¬ 
tion approximates . Next, we establish that the projected Bellman operator 

is a contraction. This result will then allow us to bound the error 

Theorem 2. is an oj-contraction with respect to the Euclidean m-weighted 

norm where: 


/3>-f: 
P <1 ■ 


UJ = 


/72(1 + A/3)2(1-A) 
/3(1+7A)2(i_A/3)’ 

/ 7^(1-/3A)(1-^ 

/3(1-7A)2 ■ 


(5) 


Proof, (sketch) The proof is almost identical to the proof of Theorem [T] only now 
we cannot apply Jensen’s inequality directly, since the rows of P^’^ do not sum to 1. 
However: 

^ _ (j _ - /3P)) 1 = a, 

where C, = Notice that each entry of P^’^ is positive. Therefore will hold 

for Jensen’s inequality. Let M = diag{m), we have 


1 , p>',P 

||u||^ — - ||P''‘’^t;|| =v^Mv — ijv ^—— M——V 
II iim ^ M Mm ^ C C 

pX,l3 

>(“) v^Mv — diag{m^ ^ )v 

= v^[M — diag{m^ P^’^)]v 
= [diag — P'^’^))] v 

v^diag{d^)v = \\v\\l^ , 

where (a) follows from the Jensen inequality and (b) from Equation (01). Therefore: 


llm — 


P 


A./3, 


n > - 

Id^ - ^ 


P 


A,/3, 


2 

m ’ 
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and; 


- T^^'>V2 
(Case A: /3 > 7 ) < 


= \\P^-~^{vi - V2)\ 


< 


(Case B: 13 <'y) < 


< 


7^(1 + A;3)^(1-A) „ _ 2 

^(1+7A)2(1-A/3) 

7^(1-/3A)(1-A) „ _ 2 

/3(1_7A)2 11^1 ^2|L 


The inequalities depending on the two cases originate from the fact that the two matri¬ 
ces p^’0 are polynomials of the same matrix and mathematical manipulation 
on the corresponding eigenvalues decomposition of (ui — U 2 ). The details are given in 
Lemma[3]of the supplementary material. 

Now, for a proper choice of f3, the operator is a contraction, and since !!„ is 
a non-expansion in the m-weighted norm, is a contraction as well. □ 

In Figure [ 3 ] we illustrate the dependence of the contraction moduli bound on A and 
j3. In particular, for A —> 1, the contraction modulus diminishes to 0. Thus, for large 
enough A, a contraction can always be guaranteed (this can also be shown mathemat¬ 
ically from the contraction results of Theorem |2]). We remark that a simila r result for 
stand ard TD(A) was established by IYuIEoI^ However, as is well-known ( Bertsekas . 
2012h , ncreasing A also increases the variance of the algorithm, and we therefore ob¬ 


tain a bias-variance trade-off in A as well as [3. Finally, note that for /3 = 7 , the contrac¬ 
tion modulus equals 


ffl 


7(1--^) 

1—7A ’ 


and that for A = 0 the result is the same as in Theorem 


6 Conclusion 

In this work we unified several off-policy TD algorithms under the ETD(A, jS) frame¬ 
work, which flexibly manages the bias and variance of the algorithm by controlling the 
decay-rate of the importance-sampling ratio. From this perspective, we showed that 
several different methods proposed in the literature are special instances of this bias- 
variance selection. 

Our main contribution is an error analysis of ETD(A, jS) that quantifies the bias- 
va riance trade-off. In particular, we sho wed that the recently proposed ETD algorithm 
of iSutton, Mahmood, and White! (1201511 has bounded bias for general behavior and tar¬ 
get policies, and that by controlling the decay-rate in the ETD(A, /?) algorithm, an im¬ 
proved performance may be obtained by reducing the variance of the algorithm while 
still maintaining a reasonable bias. 
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Figure 3: Contraction moduli of for different /3’s, as a function of the boot¬ 

strapping parameter A. Notice that we see a steep decrease in the moduli only for A 
close to 1. 


Possible future extensions of our work includes finite-time bounds for off-policy 
ETD( A, /?), an error propagation analysis of off-policy policy improvement, and solving 
the bias-variance trade-off adaptively from data. 
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A Proof of Lemma 1 


Notice that k obtains non-negative values since (s), /(s) > 0. Now, if there is a state 
s visited by the target policy, but not the behavior policy, this means that d^{s) =0, and 
that there is some t such that [d^P*]{s) > 0, and by definition f{s) > [P* P*]{s), 
so we can get k = 0. 

Next, we prove the upper bound on k. Notice that /(s) > 0, and that /(s) = 
1/(1 — P). Hence, if (1 — P)f, then there must exist some s such that d^(s) < 
(1 — P)f{s) so K < 1 — p. Now, when d^ = djr, by definition d^ = (1 — P)f and we 
obtain this upper bound. 


B Technical Part of Proposition 1 

Lemma 2. The following is true: 


'^d^(s) lim Var[Ft\St 


s] < 


P^ 

\-P 



(1+/3) 

p 

J /j.,7r 

00 

1-/32 

P 

^ /2,'7r 

00 


Proof. Notice that: 


r = diii- dj+pdjp., 


so: 


Y^d^is) lim Var[i^,|^t = s] = q^\ - f f 


<(“) dl\ + 2pfPA 
+ (dj + 2pfP^)p^P^,^{I - p^P^,^)-H 
— {dfi + PPj dfj,yD^^ {dfj, + PPJ d^) 

(1 + - (1 + 2/3) 


-f 


(dT + 2pfP^)p^P^,^{I - p^p^,^)- 


<(c) 2/3^ , o2||jT 


1 - , 


+p^ di+2pf'p^ ^ p^Ai-r^y 


<{d) 


/32 I 

( (1 + /3) 

^ 1 

Pfl.TT 

00 

1 

T—( 

^ 1-/3^ 

Pfl.TT 

00 


Where (a) comes from the inequality on /, (b) also removes the negative summand 
P"^d^P t^D^^P jdf^, and swaps sum with li norm (all coordinates are non-negative), 
(c) and (d) are from the sub-multiplicative property of induced norms (the loo norm 
originates from the transpose). □ 
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C Norm Inequality between and 

Lemma 2>. If (3 > 7- 




< 


7(i±P^pA,/3^ 


(6) 


(7) 


|/3(1+7A) 

and if l3 < 7 ; 

llpA.. ,|2 < 7(1 

II - llm - ;3(1_^A) " 

Proof Mark the orthonormal eigenvectors w.r.t. m, and corresponding eigenvalues of 
Ptt by Uj , tj respectively (tj may be a complex number, this decomposition exists over 
C almost surely). Notice that since are polynomials of P,r they have the 

same eigenvectors, with the eigenvalues := '■= correspond¬ 

ingly. Hence, we can write the first norm as follows: 

|2 


\P}’^ 


’'V ... = 


> tti 


Y^<Uj,V> P^^'^Uj 
3 

^ < Uj,V> l]uj 
3 

3 

= ^\< Ui,v>f \l]\^ IImjII; 


(8) 


And similarly for (3: 




So if we can find a constant a such that: 


<a^ 


3 — 

3 


(9) 


( 10 ) 


then could swap P^^'^v < aP^'~^v . The expression we want to maximize is: 


i]y 

/32(l-7Afj)(l-7Af*) 


( 11 ) 


7^(1 - /3Afj- - l3Xt* + /3^A^ \tj\ ) 

/ 32 (l - jXtj - jXt* + 72A2 
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Taking the derivative with respect to Re{tj), shows that there are no extrema 

points inside the ball \tj\ < 1 (we know the eigenvalues are inside this ball since they 
belong to a stochastic matrix), which means we can look at the boundary of this ball 
\tj I = 1 to find the maximum value. Since now we get dependence only on Re{tj), the 
maximum must be ontj = ±1: 


max 



/32(1±7A)2’ 


where when f3 > j the plus is larger and vice versa. 


( 12 ) 

□ 
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